Model Selection

Large Language Model Integration

# Large Language Model Integration

VideoRefer-7B is a multimodal large language model focused on video question answering tasks, capable of understanding and analyzing spatiotemporal object relationships in videos.

Transformers English

Videollama2 8x7B Base

VideoLLaMA 2 is a next-generation video large language model, focusing on enhancing spatiotemporal modeling and audio understanding capabilities, supporting multimodal video question answering and description tasks.

Transformers English

BLIP-2 is a vision-language model that combines an image encoder with a large language model for image-to-text generation and visual question answering tasks.

Transformers English

Heron Preliminary Git Llama 2 70b V0

A vision-language model pre-trained on image-text pairs, based on Llama-2 70B architecture, suitable for image caption generation tasks.

Transformers Japanese

Blip2 Opt 6.7b 8bit

BLIP-2 is a vision-language model that combines an image encoder with a large language model (OPT-6.7b) for image-to-text generation tasks.

Transformers English

Mediocreatmybest

IDEFICS-9B is a 9-billion-parameter multimodal model capable of processing both image and text inputs to generate text outputs. It is an open-source replication of Deepmind's Flamingo model.

Transformers English

BLIP-2 is a vision-language model based on OPT-6.7b, pretrained by freezing the image encoder and large language model, supporting tasks such as image-to-text generation and visual question answering.

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase